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Abstract 

Discrete random probability measures and the exchangeable random partitions they in¬ 
duce are key tools for addressing a variety of estimation and prediction problems in Bayesian 
inference. Indeed, many popular nonparametric priors, such as the Dirichlet and the Pitman- 
Yor process priors, select discrete probability distributions almost surely and, therefore, auto¬ 
matically induce exchangeable random partitions. Here we focus on the family of Gibbs-type 
priors , a recent and elegant generalization of the Dirichlet and the Pitman-Yor process pri¬ 
ors. These random probability measures share properties that are appealing both from a 
theoretical and an applied point of view: (i) they admit an intuitive characterization in terms 
of their predictive structure justifying their use in terms of a precise assumption on the learn¬ 
ing mechanism; (ii) they stand out in terms of mathematical tractability; (iii) they include 
several interesting special cases besides the Dirichlet and the Pitman-Yor processes. The 
goal of our paper is to provide a systematic and unified treatment of Gibbs-type priors and 
highlight their implications for Bayesian nonparametric inference. We will deal with their 
distributional properties, the resulting estimators, frequentist asymptotic validation and the 
construction of time-dependent versions. Applications, mainly concerning hierarchical mix¬ 
ture models and species sampling, will serve to convey the main ideas. The intuition inherent 
to this class of priors and the neat results that can be deduced for it lead one to wonder 
whether it actually represents the most natural generalization of the Dirichlet process. 
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prior; Pitman-Yor process; Mixture model; Population Genetics; Predictive distribution; 
Species sampling. 

(c) 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all 
other users, including reprinting/republishing this material for advertising or promotional purposes, creating new 
collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work 
in other works. Publisher version DOI: 10.1109/TPAMI.2013.217 


1 



1 Introduction and preliminaries 


One of the main research lines within Bayesian Nonparametrics has been the proposal and study 
of classes of random probability measures whose laws act as nonparametric priors. Several such 
classes contain, as a special case, Ferguson’s Dirichlet process [22], which still represents the 
cornerstone of the field. A recent review that covers many of these models and uses completely 
random measures as a unifying concept can be found in [45]. When going beyond the Dirichlet 
process one typically has to face a trade-off between the desire of generality (which, as far as 
inference is concerned, implies flexibility of the model) and tractability, both analytical and 
computational. Probably the most successful proposal is represented by the two-parameter 
Poisson-Dirichlet process introduced in [57] and further investigated in countless papers, most 
notably in [59, 63]. See [62] for a comprehensive review from a probabilistic perspective. Such 
a process is also known as Pitman-Yor (PY) process, especially in the Machine Learning com¬ 
munity, according to a terminology introduced in [33] which we will also adopt in the present 
paper. For our purposes it is important to note that the PY process reduces to the Dirichlet 
process by setting one of its parameters equal to 0. Nonetheless, some important distributional 
features of the PY process are fundamentally different according as to whether the value of such 
a parameter is equal to 0 or not. A clear understanding of this aspect is possible by identifying 
a large class of priors, which embeds the PY process as a special case. Such a class is given by 
Gibbs-type priors, introduced in [28] and only briefly addressed in the above mentioned review 
of nonparametric priors [45], thus motivating the main focus of this paper. In fact, by close 
inspection of the predictive structure they lead to, it will become apparent that the variety of 
distributional characteristics can be actually traced back to crucially different assumptions on 
the learning mechanism. This leads to a novel classification of discrete nonparametric priors 
which also serves as motivation for the use of Gibbs-type priors. Moreover, Gibbs-type priors 
have the advantage of pinning down, in a neat way, the analytic tractability issue related to gen¬ 
eral classes of nonparametric priors: in fact, they allow to split the prediction rule in two stages 
and to highlight the key quantity allowing simplification of the relevant expressions. Indeed, 
throughout the following sections one can appreciate the beauty and simplicity of various ana¬ 
lytical results that admit straightforward application to statistical inference. Finally, it is to be 
noted that Gibbs-type priors include other notable special cases of priors beyond the Dirichlet 
and the PY processes: for example, normalized inverse Gaussian processes [40] and their gener¬ 
alization given by normalized generalized gamma processes [43] as well as mixtures of symmetric 
Dirichlet distributions [28]. Given this, can one state with confidence that Gibbs-type priors 
are a natural generalization of the Dirichlet process, maybe the most natural? 

The present paper aims at providing a survey on Gibbs-type priors that accounts for recent 
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findings both in the probabilistic and statistical literature. This will serve as an important 
opportunity for pointing out their analytical tractability, flexibility and suitability in a variety 
of inferential problems beyond current applications which include mixture models (see, e.g., 
[33, 43]), linguistics and information retrieval in document modeling ([73, 74]), species sampling 
([42, 44, 55]) and survival analysis [37], among others. 

1.1 Discrete random probability measures, exchangeable random partitions 
and predictive distributions 

We first lay out the basics of Bayesian inference in an exchangeable framework and focus on 
some key concepts and tools. Suppose (. X n ) n >\ is an (ideally) infinite sequence of observations, 
with each X t taking values in some set X. Moreover, Px is the set of all probability measures on 
X. Assuming (X n ) n >i to be exchangeable is equivalent to assuming the existence of a probability 
distribution Q on Px such that 

Xi \p ~ p, i = l,...,n 

P ~ Q 

for any n > 1. Hence, p is a random probability measure on X and its probability distribution Q , 
also termed de Finetti measure , represents the prior distribution when (1) is used as a Bayesian 
model with an observed sample Xi, i = 1 ,,n. Whenever Q degenerates on a finite dimensional 
subspace of Px, the inferential problem is usually called parametric. On the other hand, when 
the support of Q is infinite-dimensional then one typically speaks of a nonparametric inferential 
problem and it is generally agreed (see, e.g., [23]) that having a large topological support is a 
desirable property for a nonparametric prior. Given a sample X\,..., X n generated through 
(1), the (one-step ahead) predictive distribution coincides with the posterior expected value of 
p, that is 

P(X„ +1 € ■ |Xi,..., X n ) = jT p( ■) Q(dp | X U ..., X n ), (2) 

where Q( ■ \ X \,..., X n ) denotes the posterior distribution of p. 

Discrete nonparametric priors, i.e. priors which select discrete distributions with probability 
1, play a key role in most Bayesian nonparametric procedures. It is well-known that the Dirichlet 
and the PY process priors share this property and the same can be said for the broader class of 
Gibbs-type priors. In fact, any random probability measure associated to a discrete prior can 
be represented as 

OO 

P = ^2'P] $z, (3) 

3 = 1 
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where 5 C stands for the unit point mass concentrated at c, (fij)j> 1 is a sequence of non-negative 
random variables such that J2j>iPj = 1) almost surely, and {Zj)j >i is a sequence of X-valued 
random variables. Henceforth, we further assume that (pj)j> 1 and {Zj)j>\ are independent and 
that the Zj s are iid from a diffuse probability measure P* on X (or in other terms P(Zj 7 ^ Zj ) = 1 
for any i j). Such a general subclass of discrete random probability measures has been called 
species sampling models by Pitman [59], a terminology that will be clarified in the following 
section. 

As far as the observables XVs are concerned, the discrete nature of Q implies that any sample 
X\,... ,X n will feature ties with positive probability, therefore generating K n = k < n distinct 
observations X ±,..., X£ with frequencies n±,..., nk such that Yli=i n i = n ■ When choosing and 
analyzing specific predictive structures, the key quantity to consider, from both a conceptual and 
a mathematical point of view, is the probability of observing a new distinct value not included 
in the sample X \,..., X n , namely 

P(X n+1 = “new” | X u ...,X n ), (4) 

which will appear throughout the paper. To be more concrete consider the Dirichlet and the 
PY processes. In the Dirichlet case, with parameters given by P* and 9 > 0, one has 

P(X n+1 = “new” | X u ..., X n ) = 

9 + n 

In the PY case, in addition to P*, one has two parameters (a, 9) whose admissible values are 
cr G [0,1) with 9 > —a or a < 0 with 9 = m|<r| for some positive integer rn. One then has 

P(X n+1 = “new” \X 1 ,...,X n )= d -^ 

9 + n 

from which it is apparent that the corresponding probability for the Dirichlet process is recovered 
by setting a = 0 . 

Within such a framework, discrete random probability measures can be characterized in 
terms of the exchangeable random partition they imply, another key aspect of the paper for 
which we provide some essential background. Given the discreteness of Q, p induces a partition of 
X \...., X n that is well described by means of an extremely useful tool, namely the exchangeable 
partition probability function (EPPF) [59] given by 

p { jf > {n 1 ,...,n k )= [ E(p ni (daq) ••• p nk (dx k )). (5) 

Jx k 

It is also of simple interpretability: it essentially corresponds to the probability, induced by 
p, of observing a sample of size n, X\,..., X n , exhibiting K n = k distinct observations with 
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frequencies n \,..., n k or, equivalently, a specific partition into K n = k clusters with frequencies 
ni,... ,n k - See [59, 62] for details. Note also that an EPPF satisfies the addition rule 

k 

f4 n) (m, ■■■,n k )= P ( k+i\ n 1 ) ■ • ■ ,rik, 1) + XM n+1) ( ni ’ • • • > n j + • • • > n fc)- ( 6 ) 

i=i 


For both Dirichlet and PY processes, the EPPF is available in closed form. In the former case 
it is given by 


p^\m,...,n k ) = 


e k 


09)r 


2=1 


( 7 ) 


where (0) n = 0(0 + 1) • • • (0 + n — 1) for any n > 1. For the PY process, it coincides with 


P^in i,... ,n fc ) 


rfci(g+fr) 

(0 + l)n—1 


k 

na _ 

2=1 


( 8 ) 


The identification of the EPPF leads to the direct determination of the predictive distribution in 
(2). Indeed, if Xi ,..., X n is a sample featuring k < n distinct values with respective frequencies 
ni,... ,n k , one has 


P(Y n+ i 


“new” | X 1 ,...,X n ) 


Pk+i ) ( n i> • • • 1) 

Pk\ n 


(9) 


and the predictive distribution in (2) is a linear combination of P*( ■) = E(p( •)), which can be 
interpreted as the prior guess at the shape of p, and of a weighted measure of the observations, 
namely 


P(X n+1 G ■\X 1 ,...,X n ) = 


Pk+i\ n i’ • • • x ) 

Pjk\ n ^ • • • 


P*(-) + 


P k L+1 \n l,.. .,rij + 1,... ,n k ) 


3 =1 


pf’iP'U ■ ■ -,n k ) 




( 10 ) 


Note that the right hand side of (10) is guaranteed to sum up to 1 if evaluated over the whole 
space X by the addition rule (6). In the PY process case the predictive distribution takes on 
the form 

k 

P(-Y„ +1 e . |x,,..., x n ) = p*( ■) + jX I>i - (•) (11) 

which, for a = 0, reduces to the well-known Dirichlet process predictive structure given by a 
linear combination of P* and the empirical measure. 
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1.2 Applications to species sampling problems and mixture modeling 

Discrete nonparametric priors in general, and Gibbs-type priors in particular, are suited for 
addressing inferential issues that arise in species sampling problems and in mixture modeling, 
among others. We now briefly sketch these frameworks. 

Consider a discrete random probability measure (3) with the specifications as in Section 1.1. 
It is, then, apparent that (3) can be seen as a tool for describing the structure of a population 
made of different types or species with certain proportions, which are modeled through (3) as 
random proportions pj. On the basis of this fact, one can equivalently use the Zi s or the positive 
integers {1,2,...} to label different species or types that can be sampled. Indeed, if (£ n )n>i is 
an auxiliary integer-valued sequence such that P(£ n = j \ p) = pj, for any n and j, model (1) 
corresponds to assuming that Xi = . Hence the A n ’s can be interpreted as the observed 

species labels since, due to the diffuse nature of P*. any two data points Xi and Xj , for i / j, 
differ if and only if and do. Moreover, one has that P(£j = £j) > 0, for any i ^ j , and 
this entails that the i-th and the j— th observations may reveal the same species with positive 
probability. It is precisely this connection which motivates the terminology adopted in [60], 
species sampling model. Moreover, an exchangeable sequence (X n ) n >i for which (1) holds true, 
with p a species sampling model, takes on the name of species sampling sequence. 

By virtue of this interpretation, there are a number of statistical problems one can face adopt¬ 
ing a Bayesian nonparametric perspective. Indeed, in many statistical applications one typically 
observes a sample of species labels X\.... ,X n and designs further sampling X n+ \ ,..., X n+m 
on the basis of estimates of some quantities of interest such as, e.g.: the number of new distinct 
species that will be detected in a new sample of size the number of species with a given 
frequency, or with frequency below a certain threshold, in Xi,... ,X n+m ; the probability that 
the (n + m. + l)-th draw will consist of a species having frequency l > 0 in X \,..., X n+m . These, 
in turn, provide measures of overall and rare species diversity and are of interest in biological, 
ecological or linguistic studies, just to mention a few. In this respect, the predictive approach 
briefly sketched in Section 1.1 plays an important role and provides nice and elegant answers to 
these problems in the framework of Gibbs-type priors. 

Discrete nonparametric priors are also basic building blocks for hierarchical mixture models 
that are typically used for density estimation and clustering but also in more complex dependent 
structures. To keep things simple consider the univariate density estimation case and let /(•[•) 
denote a kernel defined on IR x X and taking values in H + such that f E f{y\x) d y = 1, for any x 
in X. Hence, /(• |x) defines a density function on IR, for any x. The observations are then from 
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a sequence (Y n ) n >\ of real-valued random variables such that 

Yi | Xi ~ d /(• \Xi) t = l,...,n 

-Xj |p ~ p i = l,...,n (12) 

p ~ Q- 

The typical choice for p is represented by the Dirichlet process leading to the Dirichlet process 
mixture model introduced by Lo [49] , which represents the most popular Bayesian nonparamet- 
ric model to date. In addition to density estimation such model serves also clustering purposes. 
In fact, here (X n ) n >i is a sequence of latent exchangeable random elements and the unobserved 
number K n of distinct values among X\, ..., X n is the number of clusters into which the obser¬ 
vations Yi,... ,Y n can be grouped. Posterior inferences for K n are of great importance and the 
specification of a Gibbs-type prior p in (12) allows for an effective detection of the number of 
clusters that have generated the data. 

1.3 Outline of the paper 

Section 2 first provides an intuitive characterization of Gibbs-type priors based on a suitable 
classification of species sampling models. This is, then, followed by a formal definition and an 
overview of their distributional properties that are of interest for applications to Bayesian infer¬ 
ence. Particular emphasis is given to the role played by one of the parameters that characterizes 
them. Section 3 discusses the use of Gibbs-type priors within hierarchical mixture models for 
density estimation and clustering. Section 4 focuses on the application of Gibbs-type priors to 
prediction problems and Section 5 deals with their frequentist asymptotic properties. Section 6 
concisely discusses extensions of Gibbs-type priors to dynamic contexts. Finally, Section 7 
contains some concluding remarks trying to answer the question posed in the title of the paper. 


2 Gibbs—type priors 

An interesting and useful classification of species sampling models can be given in terms of the 
structure of the probability of generating a new value they induce. This leads to an intuitive 
characterization of Gibbs-type priors and represents also one of the main motivations for their 
use. Our result is somehow in the spirit of Zabell’s [76] characterization of the Dirichlet process in 
terms of the so-called Johnson’s sufficientness postulate. To this end, recall that the key quantity 
is (4) representing the probability of generating a new value given the past associated to a 
species sampling model as specified in Section 1.1. According to its structure one can classify the 
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underlying priors in three main categories. Denote by 0 a finite-dimensional parameter possibly 
entering the specification of p in (1). In general, one has P(X n +i = “new” |Xi,...,X n ) = 
f(n, rife, 0), which means that the probability of obtaining a new observation depends 

on the sample size n, the number of distinct values k, their frequencies (rii,...,rife) and the 
parameter 0. We will denote P(X n+ \ = “new” | Xi ,..., X n ) by f(n, k, 0) if it does not depend 
on (m,..., rife) and by /(n, 0) if it depends neither on (ni,..., rife) nor on k. 

Proposition 1 Let p be a species sampling model. Then the following classification in terms of 
the structure of the probability of generating a new value holds: 

(i) P(X ra+ i = “new” \ X \,..., X n ) = /(n, 0) if and only if p is a Dirichlet process; 

(ii) P(X n+ i = “new” | X \,..., X n ) = /(n, k, 0) if and only if p is of Gibbs-type; 

(Hi) P(X n+ i = “new” | Xi,..., X n ) = /(n, k, ni,..., rife, 0) otherwise. 

Even if the Dirichlet process has proven to perform well in several applied contexts, from a 
merely conceptual point of view it seems too restrictive to let the probability of generating new 
values depend solely on the sample size n and on its total mass parameter 0 and to summarize 
all other information contained in the data by means of a suitable specification of the scalar 
parameter 6. One would like indeed such a probability to explicitly depend also on (at least) 
the number of distinct observed values, since it summarizes the heterogeneity in the sample. By 
virtue of (ii), this is tantamount to resorting to a Gibbs-type prior. According to the specific 
situation, one might want to model (4) as an increasing or decreasing function of K n , which 
will be shown to correspond to Gibbs-type priors with a parameter, to be identified later, being 
either positive or negative, respectively. Case (iii), which corresponds to the most general setup 
and prediction of new values explicitly depends on all the information conveyed by the data, is 
in principle the most desirable prediction structure. However, there are two main operational 
problems that one needs to take into account. On the one hand, the general case (iii) gives rise to 
serious analytical hurdles and priors have to be studied on a case-by-case basis typically leading 
to quite complicated expressions (see [20]). On the other hand, it is not clear how one should 
explicitly specify the dependence of the probability of observing a new species on the observed 
frequencies n\, ... , rife so that it reflects an opinion on the learning mechanism for the data. It 
is thus reasonable that such prior opinion be encoded through the finite-dimensional parameter 
©. Hence, the above classification neatly shows the origin of the mathematical tractability of 
Gibbs-type priors, which is due to a precise simplifying assumption on the prediction structure. 
Overall, such an assumption appears to be a satisfactory compromise between generality (or 


flexibility) and tractability, and therefore motivates the attempt to study and understand the 
behavior of such priors. 

After having stated and discussed a predictive characterization of Gibbs-type priors, we now 
provide a different, though equivalent, definition which is more useful when one wishes to analyze 
their distributional properties. As seen in Section 1 , a discrete nonparametric prior p associated 
to an exchangeable scheme of the type (1) can be characterized in terms of the associated EPPF 
{p\f' : n > 1, 1, < k < n} defined as in (5). Accordingly, one defines a Gibbs-type prior as a 
species sampling model such that 

k 

p^\n 1 ,...,n k ) = Vn i k JJ(1 - o-)n;-i (13) 

i=l 

for any n > 1 , k < n and positive integers ni,...,n k such that Yli=i n i = n > where a < 1 
and the set of non-negative weights { V n k : n > 1 , 1 < k < n} satishes the forward recursive 
equation 

V n ,k = (n - ak)V n+l)k + V n+ljk +i (14) 

for any k = 1,... ,n and n > 1, with Vi i = 1. In light of (13) one can rephrase the reason for 
their tractability in more mathematical terms, namely the product form of their EPPFs which 
allows to handle conveniently the frequencies nj. Given (13), the probability of obtaining a new 
distinct observation conditional on a sample X\,..., X n such that K n = k is 

P(X n+1 = “new” \X 1 ,...,X n )= Vn ±h k +f = /(„, k, 0) 

•n,k 

which is in accordance with the above characterization. 


Remark 2 According to the classification implied by Proposition 1, mixtures of the Dirichlet 
process, obtained by mixing with respect to the total mass 9 of the base measure, are in class (ii). 
To see this, let n denote the prior on 9 so that 7r(d0 |Ad,..., X n ) oc 9 k Tr(d9)/(9) n , where ( 9) n is 
the n-th ascending factorial. Hence 

f 9 k+1 

nXn +1 = “new” \X 1 ,...,X n )= / —-vr(d0) 

ii+ Wn+l 

will now depend on k. More generally, mixtures of Gibbs-type priors obtained by mixing with 
respect to a possible parameter entering the definition ofV n are still of Gibbs-type and, thus, still 
lie in (ii). In contrast, Gibbs-type priors mixed with respect to a are such that 7 r(dcr|Xi,..., X n ) oc 
V n ,k ntr(l — cr) ni _i 7r(d<r) for some prior n on a. This clearly implies that the resulting family 
of species sampling models is in (Hi), although one still preserves a Gibbs structure conditionally 
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on a. 


The definition (13) implies that the induced predictive distributions are 

k 


p(x n+ 1 e-\x 1 ,...,x n ) = 


^n+gfc+l D *,r , ^ra+l ,k 


V n 




(15) 


'n.k * n,k 

’ ’ x=l 

Hence, the predictive distribution is a linear convex combination of the prior guess P* at the 
shape of p and of the weighted empirical distribution P n = (n — ka)~ l Y^l= i( n * ~ a ) $x*- The 
predictive structure (15) exhibits some appealing and intuitive features. In particular, the 
mechanism for allocating the predictive mass among “new” and previously observed data can 
be split into two stages. Given a sample Xi,...,X n , the first step consists in allocating the 
mass between a new value XjJ +1 sampled from P* and the set of observed values {X*,..., X])}. 
This first step depends only on n and k and not on the frequencies n \,..., n*,. The second step 
is the following: conditionally on X n +i being a new value, it is sampled from the base measure 
P*, whereas if X n+ i coincides with one of the previously observed values X*. for i = 1,..., k, 
the coincidence probabilities are determined by the size n* of each cluster and by a. Hence, 
even if the frequencies n* do not affect the probability of allocating a predicted value between 
“new” and “old”, they are explicitly involved conditional on the predicted value coinciding with a 
previously observed one: the more often a past observation is detected, the higher the probability 
of re-observing it. Also a plays an interesting role in weighting the empirical measure since, 
for a > 0, a reinforcement mechanism driven by a takes place. Indeed, one can see that the 
ratio of the probabilities assigned to any pair of (X*, X *) is given by (m — cr )/(rij — a). As 
cr —y 0, the previous quantity reduces to the ratio of the sizes of the two clusters and therefore 
the coincidence probability is proportional to the size of the cluster. On the other hand, if 
a > 0 and n* > rij , the ratio is an increasing function of a. Hence, as cr increases the mass is 
reallocated from X* to X*. This means that the sampling procedure tends to reinforce, among 
the observed clusters, those having higher frequencies, which represents an appealing feature 
in certain inferential contexts. See [43] for a discussion of such reinforcement mechanisms and 
their use in mixture models. If a < 0, the reinforcement mechanism works in the opposite way 
in the sense that the coincidence probabilities are less than proportional to the cluster size. 

Besides influencing the balancedness of the partition of the exchangeable random elements 
directed by a Gibbs-type prior, the parameter a also determines the rate at which the number 
of clusters K n increases, as the sample size n increases. As shown, e.g., in [61], if we introduce 

1 cr < 0 

c n (cr ) = log n a = 0 

n a cr £ (0,1) 
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for any n > 1, then 


K n 


c n (cr) 


s„ 


(16) 


as n —> oo. The limiting random variable S a is termed a—diversity. See [62] for details. It is 
worth noting that if p is the Dirichlet process with parameter measure OP *, the <7-diversity is 
degenerate on the total mass 0 > 0 and K n ~ 6 logn, for n large enough, almost surely. This 
special case was pointed out in [39]. The larger a, the faster the rate of increase of K n or, in 
other terms, the more new values are generated. Clearly, the case where a < 0 corresponds to 
a model accommodating for a hnite number of distinct species in the population. 

The combined effect of the reinforcement mechanism and the increase in the rate at which 
new values are generated, both driven by a, is best visualized by looking at the special case 
of the PY process. By close inspection of their predictive distributions (11) one notes that a 
new value, thus with frequency 1, entering the conditioning sample produces two effects: it is 
assigned a mass proportional to (1 —a), instead of 1, in the empirical component of the predictive 
and, correspondingly, a mass proportional to a is added to the probability of generating a new 
value. Therefore, if a > 0, new values are assigned a mass which is less than proportional to 
their cluster size (that is 1) and the remaining mass is added to the probability of generating 
a new value. The first phenomenon gives rise to the reinforcement mechanism described above: 
if the new value is, then, re-observed it increases the associated mass by a quantity which is 
now proportional to 1, and not less than proportional. The second effect implies that if X n+ \ 
is new, the probability of generating yet another new value, which overall still decreases as a 
function of n, is increased by a factor of <r/(0 + n -f 1). To sum up, the larger a the stronger is 
the reinforcement mechanism and at the same time the higher is the probability of generating 
a new value, which intuitively explains why one then obtains a growth rate of n a for K n . If 
a < 0 things work the other way round and one sees that each new generated value decreases 
the probability of generating further new values, thus providing intuition for the fact that in the 
end only a finite number of values will be generated. If a = 0, which corresponds to the Dirichlet 
process and mixtures of the Dirichlet process over the parameter 6 , everything is proportional 
to the cluster sizes which do not alter the probability of generating new values. As for another 
instance of a Gibbs-type prior, namely the normalized generalized gamma process that will be 
discussed later, a mechanism analogous to the PY process with <r 6 (0,1) can be identified 
though the proportionality constants that rescale the masses are different due to the difference 
of the underlying V U: k s. 
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2.1 Connections between Gibbs—type priors and product partition models 

There is also a close connection between Gibbs-type priors, and in particular the random par¬ 
titions they induce, and exchangeable product partition models. The latter were introduced by 
[32] and further studied, among others, by [2, 65]. If II n represents a random partition of the 
set of integers {1,..., n}, a product partition model corresponds to a probability distribution 
for n n represented as follows 

k 

P(n n = {5 1 ,...,5 fc })oc Hp(Si) (17) 

i— 1 

where p( ■) is termed cohesion function. Now, let IS) = card(S') and impose the cohesion function 
p( ■) to depend only on the cardinality of the set S, that is p(Si ) := p(|Sj|) = p(rii). This is 
a natural and reasonable choice for a cohesion function. Then the random partition in (17) is, 
for any n > k > 1, the random partition induced by an exchangeable sequence if and only if 
p{nf) = (1 — <r) ni _i/rtj! for i = 1,..., k and a E [—oo, 1] with the proviso that (1 — <r) ni _i = 1 
when a = —oo and that II n reduces to the singleton partition when a = 1. This is equivalent 
to saying that IT n is of Gibbs-type. Such a statement follows immediately from [28]. Therefore, 
random probability measures inducing exchangeable product partition models with cohesion 
function depending on the cardinality, i.e. 

X*\Yl n ~ P* i = l,...,K n 

n n ~ product partition distribution with p(S) = /?(|•S'l), 

coincide with the family of Gibbs-type priors. 


2.2 Sub—classes of Gibbs—type priors 


Many nonparametric priors currently used for Bayesian inference represent particular cases of 
Gibbs-type priors, such as the Dirichlet process and the PY family. Indeed, it can be verified 
that the set of weights 


n *~ («+i)„_i (18) 

satisfies the recursive equation (14) if the pair (a, 9) is such that a E [0,1) and 9 > —a or 
<7 < 0 and 9 = m|cr| for some positive integer m. These constraints identify the set of admissible 
values of the parameters (a, 9). The corresponding Gibbs-type prior, identified by its EPPF (13), 
reduces to (7) for a = 0 therefore leading to the Dirichlet process. For any admissible (a, 9) the 
EPPF (13) coincides with (8), thus recovering the PY family. Another interesting special case 
of the PY process, and a fortiori of Gibbs-type priors, is represented by the normalized cr-stable 
process introduced by [38] which is obtained as a PY process with a E (0,1) and 9 = 0. 
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Before discussing other special cases of Gibbs-type priors, it is worth having a closer look at 
the PY family with cr < 0 and 9 = m|cr|. In this case the weights in (18) are as follows 






k{l,...,min(n,m)} (^0 


(19) 


(m\o\ + l)n —1 

where 1,4 denotes the indicator function of set A. From (19) it is then easy to see (cfr. [62]) 
that the PY family with a < 0 and 9 = m\(j\ corresponds to a population composed by m 
different species with proportions distributed according to a symmetric Dirichlet distribution 
with density function 


f t n r ( ? «kl) TT M-i n 

■ • • iPm—l) = T, m< -|_| N [I Pi (1-PI¬ 


CT 


Pm- 1) 


|cr| — 1 


2—1 


for any (p\ ,...,p m _i) such that p* > 0 for any i and ^*=1' Pi < 1- Such a model arises, in the 
Population Genetics literature, as the stationary law of a Wright-Fisher model. 

The PY family with parameters (a,m\a\) and a < 0 is not only a distinguished special case 
of Gibbs-type prior with cr < 0 but actually is its basic building block. In fact, any Gibbs-type 
random probability measure with cr < 0 is obtained by specifying a prior n for the total number 
of species m in (19) and coincides with a species sampling model having a random (finite) 
number of species. Crucially, by [28] , the reverse implication holds true as well: any Gibbs-type 
prior with cr < 0 is a mixture of PY processes with parameters (cr, m|cr|), the mixing measure 
being a probability measure on the positive integers. Therefore, one can equivalently describe 
Gibbs-type priors with cr < 0 in terms of a mixture model as 

(pi, ■ ■ ■ ,M-i) | fh ~ ffh 


m ~ 7 r. 


Interesting special cases arise by particular specifications of it. For instance, if 

7(1 - 7) m -l 


7 r(m) = 


ml 


( 20 ) 


( 21 ) 


for m = 1, 2,... with 7 E (0,1), one obtains the model introduced by Gnedin [27], which in the 
case of cr = —1 admits a completely explicit expression of the weights, namely 

(k - 1)!(1 - 7)fc-i(7)n-fc 


kri,fc 


( 22 ) 


(n — 1)!(1 + 7)n-i 

The peculiar feature of such a model, which makes it of great use in applications, is that the 
heavy-tailedness of ( 21 ) implies a model with finite random number of species whose expected 
value is infinite. Other interesting models are obtained by specifying the mixing distribution as 
a Poisson distribution restricted to the positive integers with parameter A > 0 , i.e. 

A \ m 


7 r(m) = 


1 — e 


-A 


ml 


(23) 
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for m = 1, 2,..or as a geometric mixing distribution 


7t(?b) = (1 — ?y)?7 


m— 1 


(24) 


for some r] £ (0,1) and m = 1,2,.... These will be further discussed in Section 5.2. 

Another important sub-class of Gibbs-type priors is the normalized generalized Gamma 
(NGG) process which corresponds to 

r k -1 n-l 


Vn,k — 


e K <r 


r(n) 


E 

i =0 


n — 1 




(25) 


where a £ (0,1), fH > 0 and r(x, a) = s' 


00 ott— 1 „ — S 


ds is the incomplete gamma function. Also 


the NGG process contains several interesting special cases: if er —» 0 one obtains the Dirichlet 
process, whereas a = 1/2 yields the normalized inverse Gaussian process (N-IG) of [40], which 
stands out for the availability of a closed form expression of its finite-dimensional distributions. 
Furthermore, if /3 = 0, the normalized c-stable process is also recovered from the NGG. See 
[67, 35, 43]. The name attributed to this particular Gibbs-type prior is motivated by the 
fact that it can be defined by normalizing a generalized gamma completely random measure 
introduced in [3] and it therefore also belongs to the class of normalized random measures with 
independent increments (NRMI) introduced in [67]. Interestingly, as shown in [48], it turns out 
to be the only random probability measure belonging to both classes, NRMIs and Gibbs-type 
priors. All other NRMIs, such as for instance the generalized Dirichlet process in [41, 20], are 
not of Gibbs-type. 

In addition to specific examples described so far and still for the case of a > 0, one might 
wonder whether starting from the prediction rules (15) it is possible to identify the class of 
random probability measures generating them. The answer is affirmative and, as shown in 
[28], they coincide with the so-called cr-stable Poisson-Kingman models, which are obtained by 
means of a particular transformation of cr-stable completely random measures. The technical 
background needed for precisely defining such models goes beyond the scope of this review 
and we refer the interested reader to [61, 28]. For our purposes it is enough to note that the 
derivation of posterior quantities in this setting represents a challenging issue, which has not 
found a satisfactory solution to date. 

So far we have provided various motivations, of theoretical and practical relevance, for the 
use of Gibbs-type priors and the sub-classes discussed in this section provide a glimpse of 
the nice and simple structure they share. Nonetheless, we still need to shed some light on 
another distributional aspect which is important for assessing their suitability for nonparametric 
inference, namely their support. As mentioned in the Section 1, a large topological support is 
a desirable property for a nonparametric prior since the essence of being nonparametric can be 
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associated to the fact of assigning prior positive probability to as many “candidate models” 
as possible. When considering the weak topological support, which is the most natural in this 
framework, it can be shown (see [8]) that “genuinely nonparametric” Gibbs-type priors comply 
with this requirement and have full weak support: in other terms, any weak neighborhood of any 
distribution in Px will have a priori positive probability. Here by “genuinely nonparametric” 
we mean Gibbs-type priors whose realizations are discrete distributions for which the number 
of support points is not bounded. This essentially boils down to considering Gibbs-type priors 
either with a > 0 or with a < 0 and unbounded support of the prior it on the number of 
components in (20). Such priors can be shown to possess the full weak support property, i.e. 
their topological support coincides with the space of probability measures whose support is 
included in the support of the prior guess P*. In particular, if the support of P* coincides with 
X, the support of Q is the whole space Px- 


3 Hierarchical mixture models based on Gibbs—type priors 


As outlined in the Introduction an important application of discrete random probability measures 
and, then, of Gibbs-type priors occurs within hierarchical mixture models of the type (12): this 
corresponds to assuming exchangeable data (b)i>l from a random density defined by 


f(y ) = f f(y I x)p(dx). (26) 

Jx 

In particular, whenp follows a discrete prior Q, a key ingredient for prior and posterior inferences 
is the corresponding EPPF. Indeed, given a set of observables Y\,... ,Y n modeled according 
to the above random density, the clustering structure among the latent variables X\,... ,X n 
drives both the posterior distribution on the number of components and the posterior density 
estimation. In particular, 


P (K n = k I Yi,...,y„ 


oc 


E 

p n evjt 


pirv i, ••• > n k) n / n Kyi i Xj)p*(dxj) 


3 = 1 ' 


■ ieC, 


where Y^ n is the set of all partitions p n of the n latent variables into k disjoint clusters and Cj 
identifies the indices of those latent variables Xi that belong to the j-th cluster in the partition 
Pn £ • Therefore, the choice of Q or, equivalently, of the corresponding EPPF, is crucial for 

nonparametric Bayesian inferences in this framework and it can be further appreciated through 
some numerical illustrations we are going to provide later on in this section. 

An appealing feature of Gibbs-type priors is their ability to control the prior mass allocated 
to different partitions through the reinforcement mechanism induced by the parameter a and 
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described in Section 2. This can be appreciated by looking at the induced (prior) distribution on 
the number K n of clusters. First note that the determination of the distribution of K n follows 
from a marginalization of (13) and leads to 

P(K n = k) = %^(n,fc ;< 7) (27) 

with 

' 2=0 V 

denoting a generalized factorial coefficient. See [5] for details on ^{n, k: a). Substituting expres¬ 
sions (18) and (25) in (27) leads to the prior distributions on the number of different components 
for the PY and the NGG processes, respectively. Letting a 0 in either of the resulting ex¬ 
pressions one obtains the corresponding distribution for the Dirichlet process case 

Qk 

P (K n = k) = |s(n, k)\, 

\y)n 

with s(n,k) denoting the Stirling number of the first type. See [5]. 

A graphical display of these distributions is best suited to highlight their differences. To this 
end, fix n = 50 and consider the corresponding distributions of the number of components in 
the three above cases. For the Dirichlet process it is well-known that the total mass parameter 
9 controls the location of the distribution of 77,50 : larger values of 9 lead to a right-shift of 
the distribution implying an (a priori) larger number of components essentially affecting its 
dispersion. In both the PY process and NGG cases the role of controlling the location is played 
by 9 and /?, respectively. Hence, it is interesting to look at the additional parameter a. Figure 1 
concerns the NGG process and displays the distribution of K §o for a fixed value of (3 and with er 
ranging between 0.2 and 0.8. Note that in Figures 1, 2 and 3 the probability masses are connected 
by straight lines only for visual simplification. From Figure 1 it is evident that the addition of 
a allows to control the flatness, or the variability, of the distribution of K$o thus yielding a 
higher degree of flexibility for the model. A similar behavior appears in the PY process. Hence, 
replacing the Dirichlet process with a Gibbs-type prior characterized by a value of a in (0,1) 
allows for a better control of the informativeness of the prior number of groups, since a larger 
a flattens the prior. To better visualize this fact, it is useful to consider a simple comparative 
example. In addition to n = 50, suppose that the prior expected number of clusters is 25. This 
implies that a reasonable criterion for eliciting the parameters of a nonparametric prior is to 
fix them in a way such that E(/i, 5 o) = 25. We compare five different models: Dirichlet process 
with 9 = 19.233, PY processes with (<r, 8) = (0.25,12.2157) and (a, 9) = (0.73001,1), and NGG 
processes with (a, /3) = (0.25,48.4185) and (a, j3) = (0.7353,1), where all reported parameters 
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Figure 1: Prior distributions on the number of groups corresponding to the NGG process with 
n = 50, f3 = 1 and a = 0.2, 0.3,..., 0.7 and a = 0.8. 
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are chosen so that E(K$ 0 ) = 25. The corresponding distributions of K§q are displayed in 
Figure 2. Clearly, by increasing the value of a one obtains a less informative distribution on 
A' 50 : when moving from a = 0 to a ~ 0.73 the distribution of becomes flatter, exhibiting 
a larger variability. The Dirichlet process, instead, implies a highly peaked distribution of 
which in terms of prior specification implies the need for a reliable prior information on the 
number of clusters, which is often unavailable. Furthermore, the PY and NGG processes have 
a similar behavior with the latter producing slightly lighter tails. 



Figure 2: Prior distributions on the number of clusters corresponding to the Dirichlet (DP), the 
Pitman-Yor (PY) and the normalized generalized gamma (NGG) processes. The values of the 
parameters are set in such a way that E(ii, 5 o) = 25. 


Let us now take a further step and compare the above five processes in a toy example to 
have a closer look at the implication of such prior specifications on posterior inferences on the 
clustering structure of the data. To this end, assume that n = 50 observations are drawn from 
a uniform mixture of two well-separated Gaussian distributions, N(l,0.2) and N(10,0.2). From 
a classification perspective these data clearly identify two groups. We model them with the 
following nonparametric mixture model with standard specification 

(Yi | mi,Vi) N(mj,Wi), i = l,...,n 

{mi, Vi | p)~ p i = 1, • • • ,n 


p~ Q 
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with Q corresponding to the five processes above and P*(dm,dv) = N(m | ^,rv _1 ) Ga(u | 
2,l)dmdu, where N(- | a, b) denotes the Gaussian density with mean a and variance b > 0 
and Ga (-1 c,d) is the density corresponding to a Gamma distribution with mean c/d. A further 
hierarchy is assumed for ^ and r, i.e. /r ~ N(0,0.001) and r _1 ~ Ga(l, 100). In this setup 
the parameter specification for the five processes (chosen so that E(K§o) = 25) corresponds 
to a prior opinion on remarkably far from the true number of components in the mixture 
density that has generated the data. Given such a wrong prior specification one then wonders 
whether the models possess enough flexibility to shift a posteriori towards the correct number of 
components, namely 2. The results are based on 100000 iterations after 5000 of burn in adopting 
a standard marginal MCMC algorithm with acceleration step. See [12, 51] for further details on 
this algorithm. 

Figure 3 depicts the posterior distribution on the number of mixture components. The most 
important thing to note is that a larger a leads to better posterior estimates. Both the PY 
and NGG processes with a = 0.73, have been able to shift most of the mass towards a very 
low number of components with the PY process exhibiting a slightly better performance. See 
also Table 1 for a display of the numerical values of posterior probabilities associated to the 
possible values of K$ o- This shows how a stronger reinforcement mechanism, implying a flatter 
distribution of K n , allows to recover more effectively the correct number of components. In 
contrast, the Dirichlet process is stuck around 10 components, since the high peakedness of its 
prior on K n prevents it from overruling completely the wrong prior information. 

Finally, it is important to point out that the above considerations concerning the advantages 
of the additional parameter a hold beyond the present toy example since they represent struc¬ 
tural properties of the models, which are by now well-understood thanks to several analytical 
results and computational analyses. See, e.g., [43]. As far as the estimates of the density / in 
(26) are concerned, these are displayed in in Figure 4. Even if the considerable heterogeneity in 
the posterior inferences on the number of components is not reflected by density estimates, one 
can still appreciate a slightly better performance of the NGG and PY processes with a = 0.73001 
since they show a closer adherence to the depicted true density. 


4 Prediction in species sampling problems 

As already mentioned in Section 1, Gibbs-type priors are a powerful tool for addressing pre¬ 
diction and estimation in species sampling problems when observations are recorded from a 
population composed of individuals belonging to different types or species. This situation oc¬ 
curs in many applied research areas, including genetics, biology, ecology, economics and lin- 
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Figure 3: Posterior distributions on the number of components corresponding to mixtures of the 
Dirichlet (DP), the Pitman-Yor (PY) and the normalized generalized gamma (NGG) processes 
with n = 50 and parameters set so that E(i^ 5 o) = 25. 


k 

DP(19.233) 

NGG(0.25,48.4185) 

PY(0.25,12.216) 

NGG(0.7353,1) 

PY(0.73001,1) 

1 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

2 

0.0011 

0.0039 

0.0132 

0.2143 

0.4630 

3 

0.0068 

0.0209 

0.0487 

0.2854 

0.3015 

4 

0.0220 

0.0514 

0.0979 

0.2263 

0.1399 

5 

0.0484 

0.0894 

0.1419 

0.1360 

0.0573 

6 

0.0789 

0.1229 

0.1528 

0.0713 

0.0225 

7 

0.1069 

0.1412 

0.1506 

0.0361 

0.0092 

8 

0.1257 

0.1368 

0.1245 

0.0163 

0.0037 

9 

0.1301 

0.1187 

0.0921 

0.0079 

0.0016 

10 

0.1205 

0.0976 

0.0659 

0.0035 

0.0007 

11 

0.1031 

0.0746 

0.0435 

0.0017 

0.0003 

12 

0.0816 

0.0516 

0.0283 

0.0007 

0.0002 

13 

0.0597 

0.0353 

0.0170 

0.0003 

0.0001 

> 14 

0.1151 

0.0556 

0.0237 

0.0004 

0.0001 


Table 1: Posterior distributions on the number of components arising from mixtures of the 
Dirichlet process (DP), the normalized generalized Gamma (NGG) process and the Pitman-Yor 
(PY) process centered such that the prior expected value of the number of components is 25 
with the sample size n = 50. 
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Figure 4: Density estimates corresponding to the 5 mixture models that have been considered. 


guistics. Hence, in this section we will think of the observations X t in (1) as species labels. 
The sample data X\,..., X n one can rely on for inferential purposes yield the following pieces 
of information: the number K n of distinct species in the sample; the observed species labels 
X \...., X * Kn ; the frequencies N n = (IVi in ,..., NK n ,n) associated to each of the observed species. 
Note that the last quantity can be alternatively reformulated in terms of the frequency counts 
M n = (Mi )n ,..., M ntn ), where Mj. n is the number of species that have appeared with frequency 
i in the observed sample. It is obvious that these vectors must satisfy the following constraints: 

K n n n 

^ '] -N n — 71, y ' Mi t n — X n , y ' iMi >n — 71. 

i=1 i =1 i =1 

In such problems species labels are typically not of interest, and the data can be efficiently 
summarized by either N n or M n , namely the partition they form. Since the EPPF (5) can also 
be seen as the partition distribution induced by a sample, it is natural to resort to the class of 
priors which have the most general yet tractable partition distribution. This naturally leads to 
work with Gibbs-type priors which are characterized by the product-form EPPF (13). 

In this framework a novel Bayesian nonparametric methodology for deriving estimators of 
quantities related to an additional unobserved sample X n+ i,... ,X n+m from p, conditional on 
X\,... ,X n , has been proposed in [42] and [47]. An important applied problem is the estima¬ 
tion of the so-called overall species variety, which can be measured by estimating the number 
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(n) 

Km' = K n+m — K n of “new” distinct species that will be observed in the additional sample. 
A generalization has been recently derived in [18] and it corresponds to the estimator of the 
so-called rare species variety 

T T 

m£\t) = = E {M hn+m \X 1 ,...,X n ) (28) 

2—1 2 — 1 

namely the number of distinct species with frequency less than or equal to a specific threshold 
of abundance r that will be detected in the additional sample of size m. Note that both the 
estimator of Km \ denoted by Km \ and Mm\r) can be thought of as global measures of overall 
and rare species variety respectively, since they are referred to the whole additional sample of 
size m. On the other hand, one may also need the corresponding local measures, which can 
be quantified in terms of the discovery probability at step (n + m + 1) of the sampling process. 
Bayesian estimators of the latter have been determined in [19]. More specifically, if A i, n +m 
is the set including species labels that appear with frequency i > 0 in the enlarged sample 
Xi,, X n+m , one is interested in estimating 

Un+m,i — P (-^ 71 + 771+1 £ A i,n+m \X l ,...,X n ). (29) 

An estimator will be obtained by averaging over all possible realizations of the unobserved 
additional sample X n+ \. ..., X n+m , conditional on the basic sample X\, ..., X n . Here U n+m fl 
stands for the probability of sampling a new species at step (n + m +1), whereas Yli =o U n +m,i for 
the probability of sampling either a species not yet observed or one with frequency less than r. 
Such local estimates are relevant, for example, in determining the size m of the additional sample 
X n+ \,.. ., X n+m : a possible criterion consists in fixing m equal to the maximum possible value 
for which the estimated discovery probability of new or rare species is above a certain threshold 
probability. 

If the population is composed by a large number of unknown species (genes, agents, categories 
etc.) and the basic sample X\, ..., X n displays only a small fraction of the species present in the 
population, Gibbs-type priors with a 6 [0,1) are particularly suited. An effective and popular 
example is offered by the analysis of Expressed Sequence Tags [44] or Serial Analysis of Gene 
Expression [31] data. Indeed, in these experiments either complementary DNA (cDNA) libraries 
or messenger RNA (mRNA) populations are considered and typical goals consist in identifying 
the genes they are composed of, the relative frequencies of such genes and also in comparing 
libraries/populations in terms of diversity. Due to time and cost constraints only a small portion 
of the whole library or population is typically sequenced and prediction is required to assess the 
overall characteristics. A similar experimental framework takes place in biological applications 
such as, for example, in the analysis of T-cell identification problems (see [71]). In this case 
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one can characterize the immunological status of an organism by estimating the number of 
distinct clonotypes in a T-cell repertoire and the clonal size distribution, which is the frequency 
of clonotypes with a certain clonal size. In contrast, if the population has a limited number of 
species, a common situation in Ecology, Gibbs-type models with a < 0 are more appropriate 
[15]. In what follows, for brevity we will deal with the case a € [0,1). This implies that, when 
specializing the results to the PY process, one also has 0 > —a. However, it is to be noted that 
most of the displayed findings carry over to the case of er < 0. 

On this topic there exists a well-estalibished frequentist literature. The most relevant contri¬ 
butions typically draw inspiration from papers by I.J. Good ([29]) and I.J. Good and G.H. Toul- 
min ([30]). See, e.g., [53] and [54], For example, the popular Turing estimator for the discovery 
probability (displayed in [29] and credited to A. Turing) is 

U n ,i = (i + 1)^+^. (30) 

n 

For i = 0 it provides an estimator for the probability that the (n + l)-th observation is new. 
Equivalently, 1 — U n fl provides an estimator of the sample coverage, namely the proportion of 
species observed in the sample, which is an important quantity in many applied frameworks. 
Moreover, estimates of K m ; and of the discovery probability U n+m fi for any m > 1 have been 
established in [30] and shall be henceforth termed Good-Toulmin estimators. They coincide 
with 

OO OO 

Un + m,0 = n~ l Y, * M i,n, = ^(-1)'”^ M i>n (31) 

i— 1 i= 1 

where A = m/n. Due to the alternating sign of the sums, when A is large they can yield 
inadmissible numerical values. This instability arises even for values of m moderately large with 
respect to n, typically m greater than n is enough for it to appear. An illustration is provided 
in Section 4.1. On the other hand, we are not aware of frequentist estimators of the discovery 
probabilities U n+m ,i when both m and i are positive integers. 


4.1 Bayesian inference on overall species variety 


Based on the EPPF (13), an explicit expression for the distribution of the number of “new” 

(n) 

distinct species observed in the additional sample, Km , conditional on the information provided 
by Xi ,..., X n . has been determined in [42] and is given by 


P(K^ =j\X 1 ,...,X n ) 


Vn+m,k+j ^{m, j;a, -n + ka) 
V n ,k <y j 


(32) 
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where X \,..., X n is partitioned into K n = k clusters with respective frequencies n \,..., and 

^(n, k: a, 7) is the non-central generalized factorial coefficient 

+ her) = (j!) _1 ^(-l) r (f) ( n ~ a ( r + k ))m- 

r =0 k r / 

See [5]. An important implication of (32) is the sufficiency of K n for predicting the number 
of “new” distinct species. The expression in (32) serves then as a basis for determining the 
Bayesian nonparametric estimator, with respect to a squared loss function, of the overall species 
variety as 

Kg> = E(IfW | K n = k, N n = n) (33) 


with n = (m,... ,rik). This can be seen as a Bayesian counterpart of the Good-Toulmin 
estimator. 

When p is the PY process, with parameter (a, 9), (32) becomes 


P(A^) = j | K n = k, N n = n)= ( ^ + fc)j ~ 

(0 + n)m 


^(m, j] a, — n + ka). 


(34) 


As shown in [16], the estimator for in (33) then reduces to 


(35) 


The main advantage of (35), and of other estimators devised in [16] for measuring the overall 
species variety, is that they are explicit and can be exactly evaluated even when the size m of 
the additional sample is large compared to the size of the basic sample n. This happens, for 
instance, in genomic applications where one has to deal with relevant portions of cDNA libraries 
consisting of millions of genes. 

In several applied contexts it is useful to accompany point estimates such as (35) with the 
corresponding credible intervals. These can be easily derived from the conditional distribution 
(34). However, if the sample sizes are very large the computation of the non-central generalized 
factorial coefficient may become cumbersome. To circumvent such a problem one could resort to 
asymptotic credible intervals. This motivates, also from a practical point of view, the asymptotic 
analysis of Km \ conditional on K n , for a fixed n and as m —> 00, provided in [16]. Let f a stand 
for the density function of a positive u-stable random variable, and let U q , for any q > 0, be a 
positive random variable characterized by the following density function 


fu q (u) 


^ q-l-tAr 

T(? + 1) 
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Moreover, set B a j, as a beta random variable with parameters (a,b). Conditional on the infor¬ 
mation provided by X\,, X n , one has 


jA n ) 

J-^-m a.s. 


nrr 


J n,ki 


(36) 


as m ->• oo, where Z n ^ k = B k+e i^ n/a _ k U {e+n) i a with B k+e/atn/a _ k and U^ e+n)/a being indepen¬ 
dent. A similar asymptotic result has been obtained also for the NGG process in [17]. Note 
that the cr-diversity discussed in (16) can be recovered from (36) by setting n = k = 0. Turning 
back to the practical uses of (36) for the determination of credible asymptotic intervals for I\m , 
it is apparent that one still needs to derive the quantiles of Z nk . From an analytical point of 
view this is a challenging task, which nonetheless can be avoided by resorting to straightforward 
computational algorithms that allow to sample from the limit random variable Z nk , and thus 
to approximate the quantiles. See [16, 9] for details. 


4.2 Bayesian inference on rare species variety 

The problem of deriving estimators for the rare species variety has been recently considered 
in [17] and [19]. One of such estimators is represented by the number of distinct species with 
frequencies less than or equal to a specified threshold of abundance r, for any r < n + m, that 
are generated by the additional sample, as displayed in (28). The determination of j, under 

a square loss function, is eased by resorting to the decomposition 


M^ = N^ + O in) 

i,m i,m ' i,m 


^(n) 

where is the estimator of the number of “new” distinct species with frequency i not detected 
in Xi, ..., X n and is the estimator of the number of “old” distinct species (i.e. included in 
X \,..., X n ) that appear with frequency i in the enlarged sample. This implies that Mm\r) in 
(28) arises as the sum of two well-defined quantities: (i) the estimator of the number of “new” 
distinct species with frequencies less than or equal to t < m and generated by the additional 
sample, i.e. Nm\r) = Yll=i (ii) the estimator of the number of “old” distinct species with 

frequencies less than or equal to t < n + m and generated by updating the frequencies of the 
partition induced by the basic sample with the additional sample, i.e. Om. Hr) := ELi 62- 
It is apparent that if t = m one obtains Nm\r) = Km\ In this respect, the concept of rare 
species variety can be interpreted as a generalization of the concept of overall species variety. 

A result in [17] gives explicit expressions of the moments, of any order, of both the number 
of “new” species with frequency i in X n+ \,..., X n+m and of the number of “old” species with 
frequency i in the enlarged sample X\,... ,X n+m . From these one deduces and ()\ rri thus 
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obtaining an estimator of rare species variety. It can be seen that 


o (n) = y 

i,m / v 


t =1 
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From (37), it is clear that (K ni M\ n ,..., M T>n ) is a sufficient statistic for predicting the number 
of “old” distinct species with frequency less than or equal to r. Moreover, 
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(n) 


fm\ v n+m,k+j+i tf(m - -n + ka) 

I ■)(■*■ a )i —1 / . j 


(38) 


and K n is sufficient for predicting the number of “new” distinct species with frequency less than 
or equal to r. Finally, can be derived as the sum of the estimators in (37) and (38) and, 

then, Mm\r) from (28). 

If we focus on the special case where the Gibbs-type prior is the PY process, then the 
expressions (37) and (38) considerably simplify and reduce to 


t =i 
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o[ S = ( i ™ t ) Mt ’ n ~ 


(9 + n — t + cr) m _(j_t) 


= ( 7 ) (! — cr)i_i( 6 » + fccr) 
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It is worth noting that the determination of estimators of the rare species variety poses a major 
technical hurdle that does not occur when estimating the overall species variety. Indeed, one 
has to consider all possible modifications, induced by the observations in the additional sample, 
on the frequencies of the species detected in the basic sample. 

In the special PY process case, one can establish the asymptotic behavior of rare species 
variety as m —> oo. This is somehow in the spirit of (36) in the context of overall species variety. 
In this case, as shown in [17], one has for any i > 1 


Mi 

,n+m | Yi,..., X n d rr(l (j^)i—\ 


m u 


i\ 


J n,ki 


as m —> oo, where Z n ^ is the limit random variable introduced in (36) and —> stands for 
convergence in distribution. This implies that K n is asymptotically sufficient for predicting the 
number of distinct species with frequency i that are generated after observing the additional 
sample, conditional on the information provided by the random partition of the basic sample. 

Rare species variety can be further assessed locally in terms of discovery probabilities U n+m ^ 
as defined in (29). This leads to the proposal of Bayesian nonparametric counterparts to the 
Turing and the Good-Toulmin estimators that are recalled in (30) and in (31), respectively. If 
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one assumes a square loss function, then an estimator of U n ^ is 


Un,i — 


V r 


n-\-l,k 


Vr 


n,k 


(i - a) M it r 


( 39 ) 


for any i < n, while the discovery probability of a new species, i.e. i = 0, can be easily deduced 
from the predictive distribution (15) and is given by U n> o = ln+i,fc+i/hn,fc- Note that, unlike the 
Turing estimator, U n i depends on Mi^ n which seems to be more coherent with what intuition 
would suggest. If we now let m > 1 and j < n + m, an estimator of the discovery probability 
turns out to be 1 


U, 
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where 
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When i = 0 and m > 1, this yields the following Bayesian analog of the Good-Toulmin estimator 
for the probability of discovering a new species 


U, 


n+m, 0 
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i=0 


Vn+m+i,k+i+i ^(m, i\a, —n + ka ) 
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(41) 


whereas there is no frequentist counterpart to (40) when both m and k are positive integers. 
From these closed form expressions one can deduce a further measure of rare species variety as 

Un+m (^") — )E j Q Un+m.i • 

If one adopts a specification of the life’s yielding a PY process, nice and simple forms of the 
estimators of the discovery probabilities and of rare species variety are obtained. For example, 
the analog of the Turing estimator (30) reduces to 


^ 1 — a 

U n ,i = M itr 

0 + n 


and the Bayesian counterpart (41) of the Good-Toulmin estimator coincides with 

f~ T _ 8 + ka (9 + n + a) m 

Un+m, 0 — ~X~. TTTt | 7\ ■ 

9 + n (9 + n + l) m 

1 The estimators in (40) and (43) slightly differ from those in [19], since the latter contain a minor inaccuracy 
that we have corrected here. 
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631 
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Table 2: ESTs from two Naegleria gruberi libraries. Reported data include: frequency counts 
Mi, for different values of i, total number of distinct genes j and sample size n. Source: Susko 
and Roger (2004). 


Finally, the probability that the (n + m + l)-th observation coincides with a species detected j 
times in the enlarged sample X\,..., X n+m is 
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(43) 

To briefly illustrate the behavior of the Bayesian nonparanretric estimator based on the PY 
process and compare it with the Good-Toulmin estimator let us consider genomic data, which 
consists of Expressed Sequence Tags (EST) obtained from Naegleria gruberi cDNA libraries. 
Naegleria gruberi is a widespread free-living soil and freshwater amoeboflagellate widely studied 
in the biological literature. The two considered datasets are sequenced from two cDNA libraries 
prepared from cells grown under different culture conditions, aerobic and anaerobic, and have 
been previously analyzed in [72, 44]. The sequenced data, which will constitute the basic samples, 
are reported in Table 4.2. 

If one is interested in the probability of discovering a new gene at the (n + m + l)-th step 
of the sequencing process, one has two options: the Good-Toulmin estimator t/ n +m,o reported 
in (31) or the estimator U n+m ^ 0 in (42) which is based on the PY process. To complete the 
specification of the latter let us mention that the parameters ( a , 6) are fixed according to an 
empirical Bayes specification, which yields (0.66,155.5). The results are displayed in Figure 5. It 
is clear that the Good-Toulmin estimator exhibits an erratic behavior for values of the additional 
sample relatively larger than that of the basic sample n. This phenomenon is avoided by the 
Bayesian nonparanretric estimator since it relies on a well-defined probabilistic model in which 
all quantities are modeled jointly and coherently. For sizes of m for which the Good-Toulmin 
estimator works well, the estimators essentially coincide. Note that, in terms of the specific 
application, the anearobic library exhibits the clearly higher genetic diversity. Furthermore, as 
already mentioned, one can use such estimates to fix the size of the additional sample m as the 
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maximum integer for which the discovery probability lies above the desired threshold, which is 
typically determined also on the basis of cost considerations. 


O. 




m 


Figure 5: EST data from Naegleria gruberi aerobic and anaerobic cDNA libraries with basic sam¬ 
ple n = 950: Good Toulmin (GT) and Pitman-Yor process (PY) estimators of the probability 
of discovering a new gene at the (n + m + l)-th sampling step for m = 1,..., 2000. 


5 Frequentist asymptotics 

During the last two decades frequentist consistency has gained a major role in Bayesian non- 
parametrics and is generally accepted as a key validation criterion for the use of a nonparametric 
prior in applied problems. See [25] for a recent review on the subject. The idea that underlies 
the study of consistency consists in assuming that the data are iid from some “true” 

Po £ Px and in verifying whether the posterior distribution Q( ■ \X ±,..., X n ) accumulates in 
any suitably defined neighborhood of Po- Therefore, while the posterior is derived based on an 
assumption of exchangeability of the data as described in (1), the frequentist asymptotic eval¬ 
uation postulates plain independence of the data generating process. This explains why such 
an approach has also been termed “what if” approach by P. Diaconis. See [10]. Here we shall 
discuss consistency for Gibbs-type priors. In this respect, note that frequentist asymptotics of 
Bayesian procedures is different from the kind of asymptotics discussed in Section 4 which pre- 
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serves a Bayesian flavor since it aims at achieving a large sample approximation of the posterior 
without modifying the dependence assumption among the data. 

Let us start by fixing some notation and introducing some useful concepts. First the data 
Xi s are assumed to be iid from some “true” Po or, hi other terms, the distribution of the 
sequence of observations (A" n ) n >i is the infinite product measure Pfi° = Po x Po X ■ ■ • . If 
A e denotes a neighborhood of Po of radius e, the posterior is said to be consistent at Po if 
Q(A e \X\,... ,X n ) —> 1 almost surely with respect to Pq°, as n -* oo and for any e > 0. In 
the case of Gibbs-type priors, the natural choice for A e is represented by weak neighborhoods. 
Clearly, consistency can be achieved only at Po whose weak neighborhoods have a priori positive 
probability. In this respect, the full support property of Gibbs-type priors recalled in Section 2.2 
is important since it ensures that consistency can potentially be achieved at any given Po- 
Furthermore, note that the full support property represents a desirable property not only when 
studying consistency in the case where Gibbs-type priors are used to model directly the data, 
but also in the context of mixture models as in (12). Indeed, together with some other features 
of Gibbs-type priors, it allows to extend known consistency results for Dirichlet process mixture 
models in a straightforward way and the condition for it to hold will be essentially the same. 
See [26, 46]. 

As explained in some detail below, recent results suggest that Gibbs-type priors are always 
consistent with respect to (w.r.t.) discrete Po’s. On the other hand, when they are used to 
model data coming from diffuse distributions, inconsistency may arise. Possible inconsistency 
at diffuse Po should not, however, be interpreted as a serious issue: what really matters is the 
data generating mechanism the nonparametric prior is designed for so that consistency must 
hold w.r.t. choices of Po that are compatible with such a mechanism. Since Gibbs-type priors 
are discrete random probability measures, one should be primarily interested in investigating 
consistency w.r.t. discrete Po’s. Indeed, Gibbs-type priors, and discrete nonparametric priors in 
general, are designed to model discrete distributions and they should under no circumstance be 
used to model data coming from diffuse distributions. In the latter case they should be exploited 
within hierarchical mixtures. 

5.1 General results 

The strategy for showing consistency consists in first identifying the weak limit of the posterior, 
say P' in Px, which will be some function of Po, and then checking whether P' = Po so that 
consistency is achieved. The candidate weak limit P' is identified by investigating the asymptotic 
behavior of the predictive distribution (15) (i.e. the posterior expected value), which in explicit 
cases allows to guess P' quite easily. Then one has to show that the posterior variance of p 
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in (3) converges to 0, a.s.-Po°, which suffices to establish that the posterior concentrates in a 
weak-neighborhood of the predictive distribution. See [34, 8] for details. Now, let X\,..., X n 
denote a sample with n n distinct values with corresponding frequencies ni,..., n Kn . Even if n n 
denotes the same quantity identified as K n in previous sections, we shall use a different symbol 
to emphasize the fact that here the asymptotic behavior of the number of observed distinct 
species K n is dictated by Po from which the iid sequence is sampled and not by a Gibbs-type 
prior directing an exchangeable sequence according to (1). Different choices of Po clearly yield 
different (almost sure) limiting behaviors for n n . On the one hand, if Po is discrete with N point 
masses, for any N E IN U {oo}, then Pg°(lim n n n = N ) = 1 and Pg°(lim ri n~ 1 K n = 0) = 1 even if 
N = oo. On the other hand, if Po is diffuse, P^°(n n = n) = 1 for any n > 1. Henceforth we shall 
focus on these two cases and adopt the shorter notation K n <C a .s. n and K n ~ a .s. n, which stand 
for K n /n —> 0 and K n /n —> 1 a.s.-P“, respectively. It turns out that a key quantity for studying 
the asymptotics of the predictive distribution is given by the probability (4) of discovering a 
new observation at the (n + l)-th sampling step, which is given by V n+ i )Kn+ i/V n ^ Kn in the case 
of Gibbs-type priors. Considering a Gibbs-type prior with base measure P* having support X 
and assuming that 


kw + l.Kn + l 


a a.s.-P 


o 


as n —> oo for some constant a E [ 0 , 1 ], in [8] it is shown that 


(H) 


Q(A' e \X 1 ,...,X n ) -> 1 a.s.-P 0 °° 


as n — > oo and for any e > 0 where A! e is a weak neighborhood of P 1 . Moreover, one has 


P' = «P*( - ) + (1 — oi)Pq(-). 


(44) 


Some comments regarding the above convergence result are in order. As for the condition (H), 
it is worth noting that it holds true for all Gibbs-type priors for which an explicit expression of 
the V n)Kn ’s is known, regardless as to whether Po is discrete or diffuse. It therefore represents 
only a mild regularity condition. Moreover, the posterior distribution converges to a point mass 
at (44), a linear combination of the prior guess P* and the “true” distribution Po. Hence, weak 
consistency is guaranteed if a = 0 (and in the trivial case P* = Po to be excluded henceforth) 
and it is sufficient to check whether the probability of discovering a new value converges to 0, 
a.s.-Po°. Also, one can assess the departure from consistency by looking at the size of a: the 
larger a, the heavier the limiting mass assigned to the prior guess P*. One can even think of 
a case of “total inconsistency”, i.e. a = 1, the worst case scenario where the posterior tends to 
concentrate around the prior guess P* and no learning at all takes place. 
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To better visualize the above convergence result it is useful to look at special cases of the PY 
process with a G [0,1) and 6 > — a, for which such convergence had already been established in 
[34], From the form of their predictive distributions (11), one can immediately conjecture the 
following result: when Pq is discrete (n n <C a .s. n) we have a = 0, implying consistency; when Po 
is diffuse (k„, ~ a .s. n), we have a = a, hence inconsistency, unless a = 0, which corresponds to 
the Dirichlet case. See also [36]. An analogous result has been established for the NGG process 
together with some results concerning the case of Gibbs-type priors with a > 0 in [36]. 

Focusing now on Gibbs-type priors with a < 0 allows to highlight the occurrence of in¬ 
teresting phenomena. Recall from Section 2.2 that these priors coincide with mixtures of PY 
processes with parameters {(cr, m|oj) : m = 1, 2,...,} and they can be represented in hierar¬ 
chical form as (20). It turns out that, according to the nature of the “true” distribution Po, 
a sufficient condition can be stated in terms of the tail behavior of the mixing distribution n 
in (20). More precisely, for Gibbs-type priors with parameter a < 0 and prior guess P* whose 
support coincides with X, in [8] consistency is shown to hold 


(i) at any discrete Pq if for sufficiently large m 


tt (m + 1 ) 
7T (m) 


(Tl) 


(ii) at any diffuse Po if for sufficiently large m and for some M < 00 

7r {m + 1) M 
n(m) ~ m 


(T2) 


Condition (Tl) is an extremely mild assumption on the regularity of the tail of the mixing 
7 r: it requires x i —> 7 r(x) to be ultimately decreasing, a condition met by the commonly used 
probability measures on IN. Hence one can conclude that Gibbs-type priors with parameter 
a < 0 are essentially always consistent when Po is discrete. On the other hand, condition (T2) 
requires the tail of 7r to be sufficiently light, so when Po is diffuse one needs to closely investigate 
the tail behavior of it. 


5.2 Illustrations 

In light of the results stated above one is naturally led to wonder what happens when (T2) is 
not satisfied. To this end we consider three different Gibbs-type priors presented in Section 

2.2 with a = — 1: each prior is characterized by a specific choice of the mixing distribution n. 
We focus on the case of diffuse Po, which leads to some interesting conclusions. In the case of 
discrete Pq it is straightforward to show that (Tl) holds, hence ensuring consistency. 
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The first prior we consider, introduced in [27], is characterized by the heavy-tailed mixing 
distribution ( 21 ), which does not admit a finite expected value. Since n(m + l)/ 7 r(m) = (m — 
7 )/(m + 1) cannot be eventually bounded by Ai/m for some constant M, condition (T2) does 
not hold true. Given the V^ Kn ’s admit the simple closed form expression (22), the weights of 
the prediction rule simplify to 

Vn+l jKn + 1 _ Kn(Kn- l) 

Vn,K n n{ 7 + n) 

It is easy to see that, if Po is diffuse, implying K n ~ a .s. n, condition (H) holds true with a = 1 
and the weak limit coincides with the prior guess P*, whatever the “true” distribution of the 
data Po- This means we are in the case of “total” inconsistency. 

The second example has a Poisson mixing distribution (23) on the positive integers. Such a 
7 T has light tails and condition (T2) is satisfied since ix{m + l)/ 7 r(m) = A /(m + 1). Therefore, 
by (T2), the posterior is consistent when Po is diffuse. 

The last sub-family of Gibbs-type priors with cr = —1 is identified by a geometric mixing 
distribution (24). Note that ir(m + l)/ 7 r(m) = r/ so that condition (T2) does not hold true. It 
turns out that, with Po diffuse and n n ~ a . s . n, one obtains 

V n +i, Kn +i 2 — 77 — 2V1 - V 

—-- > a = - 

V n ,K n Tf 

See [ 8 ] for details. The limit a in (45) can be any point in [0,1] according to the value of 7 and 
therefore we can obtain the whole spectrum of weak limits (44) ranging from consistency (a = 0) 
to “total” inconsistency (a = 1). In particular, a is increasing in 77, so the larger 7, the heavier 
the limiting mass assigned to the prior guess. Small values of rj identify a situation similar to 
the second example since they yield a light-tailed 7 r. Conversely, large values of 7 are more in 
line with the first example giving rise to heavy-tailed 7 r. Finally, it is worth remarking that a 
minimal deviation from condition (T2) already produces inconsistent behaviors, even extreme 
ones, showing that (T2) is close to being necessary. 


e [0,1]. 


(45) 


6 Dependent processes for Gibbs—type priors 

In this section we briefly discuss possible extensions of the previous results to a dynamic set¬ 
ting. In particular, here we refer to time-indexed random objects, with some specification of 
the temporal transition mechanism, whose stationary, or at least marginal, states coincide in 
distribution with some random probability measure of Gibbs-type. In this respect, it is impor¬ 
tant to distinguish between two different research areas on time-dependent random probability 
measures, both related to Bayesian nonparametric priors. The main difference between these 
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two approaches, outlined below, lies in the fact that the former is mostly driven by inferential 
purposes, while the second is more concerned with the analytical properties of the constructed 
objects. If on one hand the first is closer to the interest of the Bayesian community, on the other 
it is our opinion that the two approaches have a strong potential of reciprocally benefitting from 
one another. 

The first area, concerned with so-called dependent processes, is at present an extremely ac¬ 
tive front in Bayesian Nonparametrics. Besides the pioneering contributions in [6], the modern 
approaches to the problem can be traced back to [52], Generally speaking, the aim is to inves¬ 
tigate generalizations of the Dirichlet process (or other random measures) to frameworks which 
allow for types of dependence less restrictive than exchangeability. These include for example 
dependence on time or, more generally, on covariates. See, for example, [1] for some up-to-date 
references. Most contributions in this direction exploit the representation (3) and dependence 
is quite easily induced via the weights and/or the atoms. Moreover, this allows to exploit sim¬ 
ulation techniques such as the slice sampler ([75], [7]) and the retrospective sampler [56]. The 
combination of these two main factors leads then to efficient inferential procedures in such non 
exchangeable frameworks. 

The second research area has its roots in Applied Probability and is concerned with stochastic 
population dynamics, but is also closely related to Bayesian nonparametric modeling. The main 
idea underlying the constructions in this framework is that of approximating the dynamics 
of a large population with a diffusion process, where the process dimension depends on the 
number of species the population is allowed to have. When the species can be of infinitely- 
many types, this gives rise to infinite-dimensional or measure-valued diffusions. In some cases 
the individual reproduction mechanisms yield populations whose frequencies have marginal or 
stationary states such as the one- and two-parameter Poisson-Dirichlet distribution ([13], [58]), 
the Dirichlet process ([14]), the normalized-inverse Gaussian distribution ([70]). From a Bayesian 
perspective these clearly represent dependent priors. At least in the authors’ opinion, such an 
approach represents a highly promising research line for the definition of dependent processes, 
since the possibility of studying their analytical properties also yields a deeper understanding of 
their behavior. Other reasons of interest for the Bayesian community include the use of Polya 
urn schemes for constructing some of these dependent random probability measures ([69],[64]; 
see also [4]), and the investigation of the so-called c-diversity processes (in the notation of 
Section 2). These constitute a dynamic counterpart of (16), and make explicit the dynamics 
and distributional properties concerning the evolution of the clustering structure within the 
population, as a consequence of the specific modeling dynamics at hand. See [70] and [68]. 
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7 Concluding remarks 


An intense research activity, started after the introduction of the Dirichlet process, has produced 
a vast literature concerning classes of random probability measures whose laws can be used as 
nonparametric priors. In current research the choice among these classes is often dictated by 
taste (one’s “favorite prior”), mathematical tractability or a blend of the two. For instance, 
neutral to the right priors ([11]) are typically used in survival analysis contexts since they are 
conjugate also w.r.t. right censored observations. However, there is no conceptual reason to prefer 
a conjugate prior over a non-conjugate one and it all boils down to mathematical convenience 
since it allows to evaluate posterior inferences of interest. With Gibbs-type priors things go 
the opposite way: one makes a precise assumption on the learning mechanism according to 
which the prediction of a new value depends on the sample size n and on the number of distinct 
values observed so far K n = k but not on their frequencies ni,..., and only afterwards 

investigates the implications of such an assumption. This is very much in the spirit of de Finetti 
himself who constantly emphasizes in his works the importance of formulating assumptions on 
empirically “observable” rather than on “unobservable” quantities. In this respect Gibbs-type 
priors can be seen somehow as counterparts to characterizations of parametric families in terms 
of exchangeability and some other characteristic of the observables. Consider, for instance, 
Freedman’s characterization [21] of exchangeable and rotational invariant sequences as mixtures 
of Gaussians: it is the request of rotational invariance on the observables that justifies the use 
of Gaussian distributions. In a nonparametric context, an analogous type of result (see [66, 50]) 
legitimates the use of the Dirichlet process: by assuming exchangeability and a prediction rule 
given by a linear combination of the prior guess and the empirical measure one automatically 
obtains the Dirichlet process. 

Turning back to the Gibbs-case, once the assumption on the learning mechanism is made, 
one realizes that the high degree of mathematical tractability is nothing but an implication and 
not a motivation. This then allows to work out a wealth of results concerning the behavior of 
Gibbs-type priors. Importantly, one is not anymore constrained to a logarithmic increase of K n 
as in the Dirichlet case and the whole spectrum going from a finite K n to an almost linearly 
increasing K n is available. This, in turn, produces a significantly more flexible prior on the 
number of components in mixture models. Furthermore, distributional properties and (often) 
closed form expressions for estimators of the quantities of statistical interest can be derived. An 
appealing feature is also represented by the fact that such quantities retain an intuitive flavor 
by directly relating to the key learning assumption. For instance, in the context of species 
sampling, one coherently has that K n is a sufficient statistic for predictions concerning “new” 
values. In contrast, if predictions are required for both “new” values and already observed values 
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with frequency less than or equal to r, the sufficient statistics becomes (K n , Mi ^ n ,..., M T)n ), 
which includes species with frequency not larger than r. Although more subtle, this is also 
in accordance with the key learning assumption and the implied reinforcement mechanism, 
described in the paper. Moreover, given the sound assumption on the learning scheme and 
its persuasive implications, it seems natural to use the Gibbs-framework also as basis for the 
definition of dependent processes. 

Summing up, with this review we hope to have provided an affirmative and convincing 
answer to the question posed in the title of the paper. And we are confident that the future 
will see more statistical problems laid out in the well grounded general Gibbs-type framework. 
This would bring a solid foundation to the story and obviously would not prevent to use one’s 
favorite Gibbs-type prior (e.g. the PY process) in the concrete application or even, if dropping 
the dependence on K n is legitimated by the problem at issue, returning to the “safe” Dirichlet 
world. 
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Appendix 

Proof of Proposition 1 

Recall that for any species sampling model the probability of generating a new value is of the 
form (9). For it not to depend on (m,..., rtj.) for any n > 1 and k < n, pi”' 1 necessarily has 
to be of product form. [28] have shown that an EPPF associated to a infinite exchangeable 
random partition is of product form if and only if it is given by (13) or, equivalently, if p is of 
Gibbs-type. Therefore (9) depends only on n and k if and only if it is of Gibbs-type. This 
proves the categorization of species sampling models p in classes (ii) and (iii). 

We are now left with showing that the Dirichlet process is the only species sampling model 
for which (9) neither depends on the frequencies nor on k. Given the above, this amounts to 
showing that the subclass (i) of the family of Gibbs-type priors (ii) contains only the Dirichlet 
process. 
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First we show by a contradiction argument that for (9) not to depend on k it must necessarily 
be a = 0. Then we conclude that the only Gibbs-type prior with a = 0 for which (9) does not 
depend on k is the Dirichlet process. Confining ourselves to the Gibbs-type case (since in all 
other cases (9) even depends on the frequencies), one has 

TTD/ ~V~ u ,, I V V \ -I / / \ 

P(A n+ i = “new’ | AT ,.. .,X n ) = —— -= 1 - (n - a k) 

*n,k *n,k 

and we assume it does not depend on k. This amounts to requiring that (n — a k)V n+ \ ,fc(4n,fc) _1 
does not depend on k, namely 

Vn+l,k _ c n 

~Vr^~ {n-akY 1 j 

for some c n not depending on k and, by using (14), 




(1 - Cn)- 


(47) 


The combination of (46) and (47) implies 


P(A n+1 € A | X u ... ,X n ) = (1 - Cn)Po(A) + c n V l' n] a .\ 6x*{A). (48) 

[n — a k) J 

However, this is a prediction rule corresponding to an infinite exchangeable sequence if and only 
if <t = 0. To see this, note that in view of [24, Proposition 3.2] infinite exchangeability requires 
(48) to satisfy 


P(An+i € A, X n+2 € B | Xi ,..., X n ) — P(A n _|_i E B, A n +2 G A \ AT,..., X n ) (49) 

for any n > 1 and A, B in SC. Consider, now, two sets A and B such that An B = 0, 
A n {Aii, • • •, A n } = 0 and B n {AT,..., X n } / 0. Hence, the left-hand side of (49) coincides 
with 

c n P 0 (A) jc n+ i.Po(£) + (1 -cn+ 1 ) n + 1 _ : | fc + 1 ^ - ct) ( 5x;( j B)| 

whereas the right-hand side of (49) coincides with 

c n+ iPo{A) | c n P 0 (B) + (1 - c n )— l —— V(n, - a) 5 X *(B) > 

n — kcr 3 

l 

and the two are equal if and only if, for any k = 1 ,..., n, one has 

Cn+l(l Cn) C n ( 1 C n _|_i) 

n — kcr n+1 — (fc+l)<r 
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Assuming n > 2, with k = 1 the above condition becomes 

Cn+l(l C n ) C n (l C n _|_i) 

n — a n + 1 — 2a 

and with k = 2, one has 

Cn+l(l C n ) C n (l C n - |_i ) 

n — 2a n + 1 — 3(7 

Taking the ratios of the terms in (50) and those in (51) yields 

n — 2a n + 1 — 3cr 
n — (7 n+1 — 2cr 


( 50 ) 


(51) 


and this holds true if and only if a 2 = a. This therefore contradicts the assumption that 
P(A n+ i = “new” | Xj..... X n ) does not depend on k for <r ^ 0. A different proof can also be 
derived by using the recursion (14) iterated over two prediction steps. 

Finally, recall that [28, Theorem 13], showed that Gibbs-type priors with a = 0 correspond 
to the Dirichlet process or the Dirichlet process mixture over its total mass parameter. On the 
other hand, when a = 0, (48) characterizes the Dirichlet process (see [50], [66]). Hence, Dirichlet 
process mixture over the total mass cannot belong to class (i), i.e. P(X r)+ i = “new” | X\,... , X n ) 
must depend also on k. The proof is, then, complete. □ 
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